STATS 32 Session 9: Data joining and maps

Kenneth Tay

Oct 22, 2019

Recap of week 4

Agenda for today

Making maps in R: maps package

Example: Drawing CA and its counties (1)

library(tidyverse)
library(maps)
county_data <- map_data("county")
CA_data <- county_data %>% filter(region == "california")
head(CA_data)
##        long      lat group order     region subregion
## 1 -121.4785 37.48290   157  6965 california   alameda
## 2 -121.5129 37.48290   157  6966 california   alameda
## 3 -121.8853 37.48290   157  6967 california   alameda
## 4 -121.8968 37.46571   157  6968 california   alameda
## 5 -121.9254 37.45998   157  6969 california   alameda
## 6 -121.9483 37.47717   157  6970 california   alameda

Example: Drawing CA and its counties (2)

County outlines are drawn using geom_polygon.

ggplot(data = CA_data) + 
    geom_polygon(mapping = aes(x = long, y = lat, group = group))

Example: Drawing CA and its counties (3)

coord_quickmap() preserves the aspect ratio of the map.

ggplot(data = CA_data) + 
    geom_polygon(mapping = aes(x = long, y = lat, group = group)) + 
    coord_quickmap()

Example: Drought levels in CA (1)

drought_data <- read_csv("Drought data.csv")
head(drought_data, n = 3)
## # A tibble: 3 x 2
##   County  Drought_percent
##   <chr>             <dbl>
## 1 alameda             100
## 2 alpine              100
## 3 amador              100
head(CA_data, n = 3)
##        long     lat group order     region subregion
## 1 -121.4785 37.4829   157  6965 california   alameda
## 2 -121.5129 37.4829   157  6966 california   alameda
## 3 -121.8853 37.4829   157  6967 california   alameda

Our drought data and mapping information are in different datasets!

Joining datasets

Sometimes our data are spread across different datasets, making it difficult to answer some questions.

Question: Who scored the highest in English in each class?

bio
##     Name Class
## 1 Andrew     A
## 2   John     B
## 3   Mary     A
## 4   Jane     B
scores
##     Name Subject Score
## 1   John English    76
## 2 Andrew English    66
## 3   John    Math    85
## 4   Mary English    71

Joining datasets (using dplyr)

library(dplyr)
left_join(bio, scores, by = "Name")

Joining datasets (using dplyr)

library(dplyr)
left_join(bio, scores, by = "Name")
##     Name Class Subject Score
## 1 Andrew     A English    66
## 2   John     B English    76
## 3   John     B    Math    85
## 4   Mary     A English    71
## 5   Jane     B    <NA>    NA

Joining datasets (using dplyr)

Question: Who scored the highest in English in each class?

library(dplyr)
bio %>% left_join(scores, by = "Name") %>%
    filter(Subject == "English") %>%
    group_by(Class) %>%
    top_n(1, Score)
## # A tibble: 2 x 4
## # Groups:   Class [2]
##   Name  Class Subject Score
##   <chr> <chr> <chr>   <dbl>
## 1 John  B     English    76
## 2 Mary  A     English    71

Example: Drought levels in CA (1)

drought_data <- read_csv("Drought data.csv")
head(drought_data, n = 3)
## # A tibble: 3 x 2
##   County  Drought_percent
##   <chr>             <dbl>
## 1 alameda             100
## 2 alpine              100
## 3 amador              100
head(CA_data, n = 3)
##        long     lat group order     region subregion
## 1 -121.4785 37.4829   157  6965 california   alameda
## 2 -121.5129 37.4829   157  6966 california   alameda
## 3 -121.8853 37.4829   157  6967 california   alameda

Our drought data and mapping information are in different datasets!

Example: Drought levels in CA (2)

Solution: Join the datasets together.

combined_data <- CA_data %>%
    left_join(drought_data, by = c("subregion" = "County"))
head(combined_data)
##        long      lat group order     region subregion Drought_percent
## 1 -121.4785 37.48290   157  6965 california   alameda             100
## 2 -121.5129 37.48290   157  6966 california   alameda             100
## 3 -121.8853 37.48290   157  6967 california   alameda             100
## 4 -121.8968 37.46571   157  6968 california   alameda             100
## 5 -121.9254 37.45998   157  6969 california   alameda             100
## 6 -121.9483 37.47717   157  6970 california   alameda             100

Example: Drought levels in CA (3)

Map the fill attribute of geom_polygon to the Drought_percent column.

ggplot(data = combined_data) + 
    geom_polygon(mapping = aes(x = long, y = lat, 
            group = group, fill = Drought_percent)) + 
    coord_quickmap()

Example: Drought levels in CA (4)

Use scale_fill_distiller to define a more appropriate color scale.

ggplot(data = combined_data) + 
    geom_polygon(mapping = aes(x = long, y = lat, 
            group = group, fill = Drought_percent)) + 
    scale_fill_distiller(palette = "YlOrRd", direction = 1) +
    coord_quickmap()

Today’s dataset: 2016 US Presidential Elections









Optional material

Other types of joins

Inner join: Matches pairs of observations with equal keys, drops everything else. Hence, only keeps observations which appear in both datasets.

Other types of joins

After matching pairs of observations with equal keys…